Documentation - Becoming an economist: France

Authors
Affiliations

Thomas Delcey

Université de Bourgogne

Aurelien Goutsmedt

UC Louvain; ICHEC

Published

November 30, 2024

Introduction

This database gathers information on Ph.D. in economics defended in France since 1900.1 The output of the french database is a relational database that connects different data frames. The database is structured in four main components:

  • Thesis Metadata: This component contains core details about each thesis. Each line represents a thesis record, including information such as the title, defense date, abstract, and other relevant details.
  • Edges: This table links the three previous tables, allowing to connect individuals to institutions and to theses.
  • Institutions Data: This includes all mentioned universities, laboratories, doctoral schools, and other institutions associated with the theses.
  • Individual Data: Each line represents an individual involved in the thesis, such as authors, supervisors, or jury members.

Usage and Access

Our database is also under an open licence (WHICH ONE). It can be accessed and used freely by anyone. The data is stored in the XXX repository. Note that we focus on Ph.D. in economics and we queries our sources by the field of the thesis. However, the scipts have been developped with a relative flexibility and can be adapted to other queries, for instance, for other disciplines.

If you use our data or our scripts, please cite us using the following bib reference:

@article{delcey2022becoming,
  title={Becoming an Economist: A Database of French Economics PhDs},
  author={Goutsmedt, Aurelien, Delcey, Thomas},
  journal={Working Paper},
  year={2024}
}
Warning

Be careful to the fact that some of our cleaning steps are the result of the specificities of the data we have extracted. We have checked systematically problems in our data and proceeded to manual cleaning (whether to remove some problematic titles, identify duplicates, clean institutions, etc.). If you are using our code to extract data, you should be careful to check the quality of the data you have extracted and to adapt the cleaning steps to your data. Don’t hesitate to ask us for guidance if you are using our code to extract similar data.

Presentation of the tables

Thesis Metadata

The thesis metadata table contains 16 variables:

  • thesis_id: the unique identifier of the thesis. If it exists, it is the officiel “national number of the thesis” created by the Agence Bibliographique de l’Enseignement Supérieur (ABES) and the theses.fr website. If not, it is a temporary identifier we have created.
  • year_defence: the year of the thesis defence. It covers the period between 1899 and 2023.
  • language_1 and language_2 are the languages of the thesis. It is a harmonized variable of the information on language found in SUDoc and These.fr.
  • title_fr: the title of the thesis in French.
  • title_en: the title of the thesis in English.
  • title_other: the title of the thesis in another language.
  • abstract_fr: the abstract of the thesis in French.
  • abstract_en: the abstract of the thesis in English.
  • abstract_other: the abstract of the thesis in another language.
  • field: the field of the thesis. It is a harmonized variable of the respective field variables found in SUDoc and These.fr.
  • accessible: a binary variable indicating whether the fulltext is accessible or not.
  • type: the type of the thesis. Type can take 6 values: Thèse, Thèse d’État, Thèse complémentaire, Thèse de 3e cycle, Thèse de docteur-ingénieur, Thèse sur travaux. All categories are categories found in SUDoc (see ?@tip-type).
  • country: the country where the thesis was defended (à supprimer ?).
  • url: the url of the thesis, linking to the theses.fr website or the sudoc.fr website.
  • duplicate_of: a list of identifiers that are duplicates.
Warning

Raw source data had an important rate of error in the title and abstract language: title in french were in the english column and vice versa. We have corrected this issue by using language prediction models. See Section 3.4.4 for details.

Table 1 shows a sample of the thesis metadata table. The thesis metadata table contains 21025 theses. Figure 1 shows the distribution of theses over time.

Table 1: Sample of the metadata table
Figure 1: Distribution of theses by defense date
Figure 2: Distribution of theses by defense date and type of thesis
Figure 3: Availability of abstracts
The variable type

Note that the French education system did not have a harmonized Ph.D. system between the early 1960s and 1984, date of the Sauvy reform that harmonized the Ph.D. system. During this period, various types of theses existed before. It was usual in the mid 1970s to start with a “Doctorat de 3e cycle” before making a “Doctorat d’Etat”. Thus, one author can have several types of theses. Figure 2 shows the distribution of theses over time by the type of thesis. Note also that we cannot ensure that the practice of mentioning the type of thesis in the metadata is systematic: it depends of the quality of the metadata provided by the institutions.

Availability of abstracts

The practice of providing abstracts in the metadata started in the 1980s. Before this date, the availability of abstracts is nearly null (see Figure 3).

Edges

Each line in the edge table is a unique edge between a thesis and an entity. We define entity as any individual or institution involved in the thesis. The edge table has 5:

  • thesis_id: the identifiers of a thesis (the same than in thesis_medata). In the edge table, a thesis_id can have several edges. A thesis_id has at least two edges: the author and the institution in which the thesis was defended.
  • entity_id: the identifiers of an entity.
  • entity_role: the role of the entity. A person can be either an author, a supervisor, a referee, a president or a member of jury. In addition to the main institution in which the Ph.D. was defended, the entity_role can contain additional information we were able to collect such as the other institutions, laboratories, doctoral schools (the institution organizing the doctorate in french university). Note that it concerns only theses collected in these.fr after 1985. For SUDoc, the value etablissements_soutenance_from_info may provide additional information on the institution.
  • entity_firstname: The name of the entity. Each entity has a entity_name. Note that the entity identifiers is unique but the entity name is not unique. For instance, two different persons can have the same name. When available, an individual can have a entity_firstname.
Warning

Most of our effort in building the database was to delete duplicates in entities so that users can easily estimate the involvement of an entity in theses. It is the case for most institutions which are well identified by an unique idref. Unfortunately, it was harder to disambiguous individual entities. To illustrate this point, it is very easy to spot that the string “Université Paris I” and “Université Paris I Panthéon-Sorbonne” are the same entity but we cannot be sure that “Thomas Delcey” authoring a Ph.D in 2021 is the same person than “Thomas Delcey” supervising a Ph.D. in 2022. The variable homonym_of helps the users to spot potential duplicates. See details in Section 3.4.6.

Table 2 shows a sample of the thesis edge table. We identify 91057 edges in total. Figure 4 shows the distribution of individuals by role. Figure 5 shows the distribution of individuals for the top institutions.

Table 2: Sample of the edges table
Figure 4: Top role
Figure 5: Top role

Institutions

The thesis institution table contains 1790 institutions. Institutions are the universities, laboratories, doctoral schools, and other institutions associated with the theses.

The thesis institution table contains 19 variables. It consists of two core variables:

  • entity_id: the unique identifier of the entity (here the institution).
  • entity_name: the name of the entity.

The other variables are additional information on the institution provided by the IdRef database:

  • url: the url of the entity.
  • scraped_id: the identifier of the entity in the scraped data.
  • pref_name: the preferred name of the entity.
  • other_labels: other labels of the entity.
  • country: the country of the entity.
  • date_of_birth: the date of birth of the entity.
  • date_of_death: the date of death of the entity.
  • information: additional information on the entity.
  • replaced_idref: the identifier of the entity that replaced the entity.
  • predecessor: the predecessor of the entity.
  • predecessor_idref: the identifier of the predecessor of the entity.
  • successor: the successor of the entity.
  • successor_idref: the identifier of the successor of the entity.
  • subordinated: the subordinated entity.
  • subordinated_idref: the identifier of the subordinated entity.
  • unit_of: the unit of the entity.
  • unit_of_idref: the identifier of the unit of the entity.
  • other_link: other links of the entity.
  • info: additional information on the entity.
  • country_name: the country name of the entity.

Table 3 shows a sample of the thesis institution table.

Table 3: Sample of the thesis institution table

Individuals

The thesis person table contains 16 variables. The four core variables are:

  • entity_id: the unique identifier of the individual.
  • entity_name: the name of the individual.
  • entity_firstname: the first name of the individual.
  • gender: the gender of the individual according to the IdRef database.
  • gender_expanded: the gender of the individual according to the IdRef database augmented for missing values with the French census data (see details in Section 3.4.5).

The other variables are additional information on the individual provided by the IdRef database:

  • birth: the birth date of the individual.
  • country: the country of the individual.
  • info: additional information on the individual.
  • organization: the organization of the individual.
  • last_date_org: the last date of the organization.
  • start_date_org: the start date of the organization.
  • end_date_org: the end date of the organization.
  • other_link: other links of the individual.
  • country_name: the country name of the individual.
  • homonym_of: the identifier of the homonyms in the database.

Table 4 shows a sample of the thesis metadata table.

Table 4: Sample of the thesis person table
Figure 6: Distribution of individuals by gender
Figure 7: Distribution of individuals by country (top 10 excluding France)

Data collection and cleaning process

The data collection process is divided into two main steps:

  • Scraping: The first step consists of scraping data from the three main sources: Theses.fr, SUDoc, and IdRef.
  • Cleaning: The second step involves cleaning the raw data files to create the final database.

We focus here on a general presentation, focusing the methodological choices we made.

General presentation

Scraping

The data used in this project comes from three mains sources:

These sources are the result of the work of the ABES (l’Agence bibliographique de l’enseignement supérieur) who produced metadata and APIs regarding research and superior education. The data of the three sources mentionned above are under the Etabab “Open Licence”.2

  • Theses.fr is a comprehensive repository for PhD dissertations defended in French institutions since 1985.3 It includes metadata such as the title of the dissertation, author, date of defense, institution, supervisor, abstract, etc.. The database covers a wide range of disciplines, providing access, in some cases, to digital theses.

  • SUDoc stands for Système Universitaire de Documentation. It is a union catalog that includes references to various documents held in French academic and research libraries. It covers books, journal articles, dissertations, and other academic works. The SUDoc database includes metadata like title, author, publication date, and library locations where the documents can be found. It’s a key resource for academic research in France, providing a broad overview of available scholarly materials. Regarding PhD, it allows to find dissertations defended before 1985, and to recover relevant metadata.

  • IdRef stands for Identifiants et Référentiels pour l’Enseignement supérieur et la Recherche. It is a database focused on managing and standardizing the names and identifiers of authors and other contributors to academic and research works. It provides authority control for names used in academic cataloging, ensuring consistency and aiding in accurate attribution of works. IdRef is used in conjunction with SUDoc and other databases to support the management of bibliographic data in the French higher education and research sectors. In our project, it allows us to find additional data on individuals and institutions.

Data collection

theses.fr

Theses records are registered in theses.fr since 1985. Theses.fr data are also stored on data.gouv.fr website. They can be downloaded directly at this URL. The downloading_theses_fr.R script allows to download the .csv on data.gouv and to compress and store it in .rds format.

SUDoc

We systematically collect metadata on French dissertations archived in the SUDoc database, focusing on theses in economics through two distinct query strategies:

  • First query: We search for dissertations with a term starting with “econo” in the “Note de Thèse” field, which denotes the thesis discipline. This keyword captures terms like “économie” or “Economique” since SUDoc ’s search function is case-insensitive and ignores accents. The time frame is limited to 1900–1985, as dissertations from later years are systematically cataloged in Theses.fr. Here is the query, allowing to retrieve thesis records.

  • Second query: We search for dissertations where “droit” (law) is specified in the “Note de Thèse” field, and where a term starting with “econo” appears in the title. This search is limited to 1900-1968 to capture dissertations classified as law theses before 1968 that likely focus on economics. Here is the query, allowing to retrieve thesis records.

The scraping_sudoc_id.R collects the thesis records URLs. Then, the scraping_sudoc_api.R allows to query the SUDoc API to retrieve structured metadata for each thesis, including information such as title, author, defence date, abstract, supervisor and other relevant details. These metadata are stored in an .xml file, which we then parse to extract the relevant information. The .xml is structured according to “tags” and “codes” that are explained here.

Note

scraping_sudoc_api.R used parallel processing to speed up the data collection process. The script is optimized to handle errors and exceptions, ensuring robust data collection. It can be easily adapted to other queries.

IdRef

We use the idref identifiers, collected in sudoc and these.fr sources, to retrieve additional information on individuals (e.g., date of birth, nationality, gender, last known institutions) and institutions (e.g., institutions preferred and alternate names, years of existence). The scraping_idref_person.R and scraping_idref_institution.R scripts use the idref identifiers as input to query those pieces of information in the IdRef API and organized them in structured tables.

Cleaning

Our data-cleaning approach focuses on ensuring consistency and quality while preserving the integrity of the original data. We relies on two principles to build this database:

  • no data transformation: our work is mainly a data collection, categorization, and cleaning work. We tried as less as possible to transform the data or limit the transformation to minimal and impactless transformation. To put it differently, we did not touch the cell values of the original data, mainly the encoded variables.

  • disambiguation: we tried to disambiguate the entities (thesis, authors, etc.) as much as possible. Disambiguation refers to the process of identifying and distinguishing between different entities that may have the same name. We tried to provide a unique identifier for each entity. The identifiers of the Agence Bibliographique de l’Enseignement Supérieur (ABSES) was the main source of unique identifiers (idref, nnt, etc.). When their identifiers were not available or disambiguaion was not possible, we created our own temporary unique identifiers.

The first step is to clean the data from the raw sources and harmonize the data structure to facilitate the merging of the two datasets. The output of this step is the three database dividing information into four tables: metadata, person, institution and their relationships.

SUDoc

The cleaning process for SUDPC data in 1_FR_sudoc_cleaning.R has two main objectives: first, it is managing identifiers duplicates. Second, we transform raw sources from sudoc to a structured dataset. We evaluate the quality of data and then we structure the raw source to ensure consistency and the future merging with theses.fr data.

Duplicate Management

The script manages identifiers duplicates, which fall into two categories:

  • True duplicates: These occur when the same thesis is listed multiple times with identical identifiers and authors but differing defence dates. The process retains the most recent record as it is more likely to reflect the correct metadata.
  • False duplicates: These occur when the same identifier is shared by different authors, often due to data entry errors. To resolve these, unique identifiers are created by appending a counter to the nnt, ensuring data integrity without introducing ambiguity.

Data Standardization

Most of the variables of the final data are created here from the raw data. Two variables deserves a particular attention:

  • year_defence: For some theses, we retrieve multiples different dates of defence. We choose the oldest date for theses with multiple dates, as the earliest date is more likely to reflect an unfinished thesis. We also manually check when the two dates were not close to each other. We also clean anomalous dates outside the query range (1899–1985).
  • type: Another important variable created here is the type of the thesis, since the French systems had various kinds of thesis between the 1960s and the 1984 reform. We use different raw sources of sudoc metadata to spot the type of thesis. Thesis types are recoded into consistent categories (e.g., “Thèse d’État”, “Thèse de 3e cycle”). Records that are not doctoral theses (e.g., master’s dissertations) are filtered out to focus exclusively on relevant entries. Note that if we cannot spot a particular type of thesis, the variable takes the generic value “Thèse”. Language codes are also standardized to align with ISO conventions and ensure compatibility with these.fr data.
Warning

The value “Thèse” of the Type variable is default value when we cannot spot a particular type of thesis.

The final dataset is split into the four tables that make up the relational database (metadata, edge, person, and institution). Temporary IDs are generated for entities without official identifiers to facilitate later identification and disambiguation.

Theses.fr

The 2_FR_thesesfr_cleaning.R is dedicated to cleaning and structuring metadata for theses related to economics extracted from the Theses.fr database. The strategy is the same as for SUDoc: checking the quality of data and transforming raw sources into a structured dataset and prepare the dataset for integration with SUDoc data. The only particular point in this script is that we had to remove some theses that were not related to economics but were wrongly categorized as such in our query. We then proceed to the same steps as for SUDoc data, categorizing and harmonizing variables, to prepare the merging. Again, temporary IDs are generated for entities without official identifiers to facilitate later identification and disambiguation.

Merging

The 3_FR_merging_database.R merge the set of tables created from the SUDoc and Theses.fr source. There is no particular difficulty in this script. We do not handle duplicates in this script, as we will do it in the next steps.

Metadata

The 4_FR_cleaning_thesis_metadata.R script is designed to clean metadata information. Sudoc and these.fr sources gathered information inputted by various local institutions and individuals, leading to inconsistencies and errors. The script addresses several key challenges:

  • Language detection: Language consistency is verified across the metadata by leveraging both the cld3 [@R-cld3] and fastText [@fastText2016b] models for robust identification. Language consistency is verified across the metadata. Titles and abstracts are checked to ensure that French and English columns contain text matching their intended language. Discrepancies are resolved by reassigning text to the correct fields. For cases where either French or English titles and abstracts are missing, the script employs auxiliary columns originally scraped (title_other and abstract_other) to fill gaps when relevant. Titles and abstracts written in full uppercase are transformed into sentence case to enhance readability. Placeholder text and irrelevant symbols are also removed, with uninformative entries replaced by missing values (NA).
  • Duplicate: We found many duplicate thesis records in the metadata. It is explained both by the fact that the same thesis can be registered in both Sudoc and Theses.fr and by the fact that the same thesis can be registered several times in the same database by different institutions. We manage duplicates by developping a duplicate detection algorithm. The core of the detection process involves grouping titles by authors and comparing all possible title pairs within each group. The Optimal String Alignment (OSA) distance is used as the primary metric for this comparison. OSA is a robust variant of the Levenshtein distance that estimate the number of actions necessary to match two strings (character insertions, deletions, substitutions, and adjacent character transpositions). The less the distance is, the more the two strings are similar. We also use a normalized OSA distance taking into account the titles lengths. Each potential duplicates is checked by eye and we ensured to capture most true positives and to avoid any false positives. Finally, consistently with the general approach of the project, we did not remove the duplicates but we flagged them in a new column duplicates. Table 5 shows an example of two duplicates.
Table 5: Example of two duplicates
Note

Our script can handle duplicate manually. If you spot an undetected duplicate, please let us know.

Institutions

The 5_FR_cleaning_institution.R script aims to standardize and improve the quality of institution data.

So far, any institution names mentioned in the metadata have been extracted and stored in a separate table. This script focuses on cleaning and standardizing these names to ensure consistency and accuracy in the dataset. Our goal is to replace temporary institution identifiers (id_temp) we have created in 3_FR_merging_database.R with the official IdRef identifiers (id_ref) to ensure consistency and accuracy in the dataset. This replacement relies on matching the institution names and thesis defense dates. The process accounts for historical changes in institutional structures (e.g., the splitting of the University of Paris after 1968), ensuring that ambiguous cases are handled carefully.

The core of the script is a manually defined table that associates regular expressions (regex) for institution names with their corresponding idref identifiers. This table also includes the dates of creation (date_of_birth) and dissolution (date_of_death) of institutions to set clear boundaries for replacement. For instance, if an institution’s name matches “University of Paris” and the thesis was defended before 1970, the identifier is replaced with that of the historic University of Paris, as it was the only university in Paris at the time.

Warning

We kept the temporary identifier and did not assign an idref institution names when we were enable to resolve the ambiguity. For instance, we kept the temporary identifier if the thesis is defended in 2022 and the institution name is ambigous (for instance, “Université de Paris” could be the University of Paris I Panthéon-Sorbonne or the University of Paris II Panthéon-Assas).

Individuals

The 6_FR_cleaning_persons.R script aims to standardize and improve the quality of person data.

First, this script adds information on individuals from the idref_person_table. When a name entity is associated to an idref identifier, the script adds supplementary information on the person provided by the IdRef database (organization, birth date, relevant links such as wikipedia pages, etc.). We also replace the raw names (found in sudoc or these.fr) by the names provided by idref source.

Second, we try to clean and unify person identifiers. We know that the same person can have slightly different names (e.g., “Jean A. Dupont” and “Jean Dupont”) or that the same person can have the same names but different identifiers. It is particularly important for person present in both SUDoc and Theses.fr databases. For instance, the same person can be the author of a thesis in SUDoc in 1983 and a member of the jury of thesis found in these.fr in 1999. Contrary to institution tables, however, we are enable to disambiguate person identifiers because of the risk of homonyms. In other words, if two persons have the same string names, we cannot be sure that they are the same person. Two authors of two different theses could have the same name or it could be the same person doing two theses. We create a new column homonym_of that group potential homonyms. For each person, the variable homonym_of gives the list of person identifiers that are her homonyms.

Warning

There is an important risk that a same person has different identifiers in the database. Potential candidates can be spotted by the homonym_of variable. Currently, our script is not able to disambiguate the homonyms manually. This is one feature that we need to add in the future.

Improvements

Footnotes

  1. While focusing on France, the database and his documentation are in english. It is because this project is part of a broader initiative entitled Becoming an economists that aim at building a comprehensive database of Ph.D. in economics accross the world.↩︎

  2. See the English description of the licence here.↩︎

  3. This corresponds to the reform of French PhD and the implementation of the “new regime”.↩︎